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ABSTRACT 


Knowledge tracing is a popular and successful approach to 
modeling student learning. In this paper we investigate 
whether the addition of neuroimaging observations to a knowl- 
edge tracing model enables accurate prediction of memory 
performance in held-out data. We propose a Hidden Markov 
Model of memory acquisition related to Bayesian Knowledge 
Tracing and show how continuous functional magnetic reso- 
nance imaging (fMRI) signals can be incorporated as obser- 
vations related to latent knowledge states. We then show, 
using data collected from a simple second-language learn- 
ing experiment, that {MRI data acquired during a learning 
session can be used to improve predictions about student 
memory at test. The fitted models can also potentially give 
new insight into the neural mechanisms that contribute to 
learning and memory. 


1. INTRODUCTION 


A shared goal for both cognitive science and educational 
data mining is the development of accurate models of hu- 
man learning. On the basic science side, learning and mem- 
ory are important functions of the human brain that support 
our ability to flexibly interact with our environment. On the 
education side, predictive theories of learning may be lever- 
aged by intelligent tutoring systems (ITS) to individually 
optimize instruction [3, 20]. 


Perhaps the most influential approach to modeling student 
learning in the educational data mining literature is “knowl- 
edge tracing” [5, 10] whereby the learned mastery of a par- 
ticular skill or fact is treated as a latent state and the proba- 
bility that a person’s knowledge is in that state is updated in 
light of observed student behavior. For example, in Bayesian 
Knowledge Tracing (BKT), each learning unit is assumed to 
be in one of two discrete states: {unknown, known}. Each 


*D. Halpern and §S. Tubridy contributed equally to the 
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time the student engages in a learning activity, the latent 
knowledge can transition from the unknown to the known 
state with probability |. Performance on a test, quiz, or 
exercise is conditional on the latent knowledge state, such 
that being in the known state is typically associated with a 
higher probability of issuing a correct answer than being in 
the unknown state. Using the model, it is possible to infer 
posterior probabilities of the knowledge state of each learner 
and skill using Bayes’ rule, given the pattern of responses 
made on various assessments or quizzes. These probabili- 
ties are then used to make predictions about learning per- 
formance for new students, as well as to design optimized 
instruction policies. 


Research in this area focuses on building more precise mod- 
els of student learning by, for instance, incorporating fac- 
tors that reflect individual abilities [41, 21], contextual fac- 
tors that contribute to errors [6], or models of the exact 
moment at which a skill is acquired [7]. However, one rel- 
atively underexplored question is what types of observable 
data may be most useful for informing inferences about la- 
tent knowledge states during learning. Of particular interest 
is the idea that many other features besides overt responses 
might be partially informative. For example, the student’s 
response time to a test question may add additional informa- 
tion about learning alongside correctness [24, 38, 40]. Like- 
wise, patterns of mouse or eye movements during a learning 
session might help index drifting attention [8, 27]. 


In this paper we demonstrate that it is possible to integrate 
indirect neural measurements of brain activity into a cogni- 
tive model of learning in a way that 1) can improve predic- 
tion of a learner’s test performance at a 72 hour delay and 2) 
allows knowledge tracing without interrupting the learning 
environment with explicit tests or assessments (which can 
be distracting or may bias learning). 


Although acquiring neural recordings is impractical in most 
educational settings, the approach of fusing multiple sources 
of sensor data about individual learners may be a generally 
useful method for the educational data mining literature. 
In addition, as we show in our results, such modeling efforts 
may also feedback to contribute to a better understanding of 
the neural and cognitive mechanisms that support learning 
and memory [2, 1, 34, 35]. Finally, as the cost and diffi- 
culty of making indirect neural recordings falls (e.g., due to 
the advent of portable, dry contact electroencephalogram 
or EEG) the practicality of utilizing such sensors will likely 
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increase (c.f., [14]). 


We begin by reviewing past work in cognitive neuroscience 
which has attempted to identify predictive signals of learn- 
ing and memory processes. Next we describe our approach 
fusing concepts from knowledge tracing with what is known 
about the cognitive neuroscience of memory. We then de- 
scribe a dataset collected from human participants perform- 
ing a simple second-language learning task while undergoing 
functional magnetic resonance imaging ({MRI). We compare 
the predictive power of a variety of models against held-out 
memory recall data at study-test delays ranging from one 
day to one week. From the fitted model we then extract the 
neural signals corresponding to learning in the study period. 


1.1 Prior work using cognitive neuroscience 


methods to predict individual learning 

The prediction and optimization of human learning has been 
a long standing goal of cognitive neuroscience research. On 
the prediction side, a number of studies have explored the 
“subsequent memory” paradigm [28, 23, 13, 26]. In these 
experiments, participants study controlled stimuli such as 
lists of word pairs while brain signals (such as the blood 
oxygen-level dependent “BOLD” signal measured via {MRI 
or event-related potentials, ERPs, assessed with EEG) are 
recorded. Some time later, participants’ memory is tested 
for the material they saw during study. Accuracy on each 
memory test item is used to back-sort the neural data record- 
ings into brain patterns associated with successful versus un- 
successful later memory. Regions with a reliable difference 
in brain activation between these two classes are taken to 
reflect neural correlates supporting lasting memory forma- 
tion. Across these studies a coherent set of brain regions 
have been identified as being involved in human memory 
formation including the hippocampus and medial temporal 
lobe, which have long been associated with memory forma- 
tion on the basis of animal and lesion studies [29, 9]. 


Building on this work, Fukuda et al. (2015) identified two 
EEG-based subsequent memory signals and used these to 
classify study trials in a memory experiment as likely to 
be remembered (initially well studied) or forgotten (intially 
poorly studied). In a subsequent session, participants were 
allowed to restudy half of the items identified as initially 
well studied and half of the items identified as initially poorly 
studied. A final test then assessed knowledge for all of the 
items. Of particular interest was the finding that the restudy 
opportunity most benefitted the initially poorly studied items 
compared to the other items. Importantly, the entire pre- 
diction about what was or wasn’t well studied was based 
exclusively on indirect neural recordings for each subject 
rather than any explicit assessment or test. 


The subsequent memory paradigm has been a powerful tool 
for studying the neural basis of memory. However, the cog- 
nitive neuroscience literature does not currently take advan- 
tage of the wealth of knowledge about predicting individ- 
ual learning from the educational data mining and cognitive 
modeling literatures. For example, classifying brain pat- 
terns as forgotten based on a single test fails to account 
for the possibility of “slippage” (errors in performance of 
a mastered skill due to chance) which is central to BKT 
models [10]. Likewise, when an item is not remembered 


at test it could be for a number of reasons: the item may 
have been poorly encoded during the study session, or per- 
haps was well encoded and would have been remembered 
at an earlier study session but was simply forgotten due to 
decay or interference. Structured models such as Hidden 
Markov Models (HMMs) can account for such latent mem- 
ory dynamics and use them to help improve predictions. 
The subsequent memory approach is also difficult to apply 
when learners get repeated study opportunities because of 
ambiguity about which brain scans should be classified as 
causally related to the test performance. Finally, the stan- 
dards for model development within the machine learning 
and data mining communities is predictive performance on 
held-out data which is often more difficult than describing 
statistically reliably patterns within a single data set due to 
the ability to overfit. 


To address these issues, we describe an approach to the si- 
multaneous modeling of behavior and neural recordings in a 
single knowledge tracing model’. Our aim is to demonstrate 
the value of combining insights from these still somewhat 
disparate literatures. The approach we take is in some ways 
similar to work by Anderson and colleagues that has tried 
to infer from fMRI the mental state of individuals as they 
engage in complex math problems [2, 1, 4, 42, 33] (see also 
[34, 35]). While these reports hint at the utility of com- 
bining {MRI with probabilistic cognitive models, this prior 
work does not specifically address the learning and memory 
issues considered here. 


2. THE OMNI DATA SET 

The dataset we consider, part of the NSF-funded “Optimiz- 
ing Memory using Neural Information” (OMNI) project”, 
consists of human performance on a cued-recall memory 
test for a set of Lithuanian-English word translations. The 
learner’s task is to study the word pairs across multiple pre- 
sentations and then, after a delay, recall the English asso- 
ciate for a presented Lithuanian word. 


Starting with a normed set of Lithuanian-English words, we 
selected 45 translation pairs [19]. During study, participants 
saw the translation pairs presented one at a time for 4 sec- 
onds each with a variable duration inter-trial interval (4s-16s 
for consistency with event-related MRI timing). Words were 
presented on a computer screen with the Lithuanian word 
at the top of the screen and the English translation under- 
neath. 


Each word pair was presented five times and no pair was 
presented for the nth repetition until all words had n — 1 
presentations. Importantly, and in contrast to many psy- 
chology studies on the subsequent memory effect, all partic- 
ipants see the same sequence of study items®. Immediately 


"Here we focus on {MRI due to improved spatial resolution, 
even though other methods (e.g., EEG and skin conduc- 
tance response), also provide useful signals that correlate 
with memory performance and could be incorporated into 
our approach. 

“http: //gureckislab.org/omni 

3 Although the models we apply do not explicitly model 
inter-item interactions, maintaining a fixed sequence across 
participants ensures that some of these inter-item effects will 
be captured in the model parameters we estimate because, 
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following the study session participants gave judgments of 
learning (JOLs, [22]): for each pair participants were pre- 
sented with the Lithuanian and English word and used the 
computer mouse to indicate on a scale of 0-100 how likely 
they were to remember the association in one week. 


Participants were given either an immediate recall test (0 
hours) or returned to the lab approximately 24, 72, or 168 
hours after the initial study session (randomly assigned)’. 
During the recall test, participants saw a Lithuanian word 
presented on the screen and had to type the associated En- 
glish word. A trial was coded as correct if participants typed 
the correct English word (allowing for typographic errors) 
and all other responses were incorrect. 


For more efficient estimation of the different model parame- 
ters, we conducted a large behavioral experiment outside of 
the MRI scanner and combined those data with additional 
observations from participants who performed the same task 
during MRI scanning (under this view all participants are 
equally useful but purely behavioral subjects are treated as 
though their MRI data are “missing” and so estimates of 
their learning are based on the observed JOLs and recall 
performance). Each participant (N=189) was tested at one 
of the four study-test delays. Among the behavioral partici- 
pants (i.e., no MRI data) the group Ns were 20, 49, 60, and 
49 in the 0, 24, 72, and 168 hour study-test delay groups, 
respectively. All MRI participants (N=21) were tested at 
the 72 hour delay. 


MRI participants underwent an identical study-test proce- 
dure as the behavioral participants except they were scanned 
during the study session. MRI data were collected on a 
Siemens Prisma 3T at the New York University Center for 
Brain Imaging. Functional Blood Oxygen-Level Dependent 
(BOLD) data covering the cortex were acquired at a spatial 
resolution of 2.5 mm? with a 1 second repetition time (TR; 
the temporal resolution of the {MRI data) and anatomical 


scans were collected at a spatial resolution of .75 mm?. 


To summarize, the final data set consists of a record for each 
learner that contains: the pattern of recall attempts for each 
list item, JOLs collected after the study session for each list 
item, and, for each MRI participant, the 65x77x73 set of 
voxel measurements across 2936 time-points describing the 
BOLD signal recorded with MRI. 


Figure 1 shows key features of the behavioral data. Across 
the four different test delays, memory performance generally 
drops, likely due to forgetting. Participant performance var- 
ied widely from 0 to 100 percent correct. In addition, across 
participants, average JOLs following study were weakly cor- 
related with performance (r = [0.43,0.24,0.31,0.55] and 
p = (0.06, 0.10, 0.004, 3.4e—5] in the Oh, 24h, 72h, and 168h 
groups, respectively). Pooling across all participants, the 
mean JOL correlation with final performance is low but sig- 
nificant, r = .365, p < le—7. 


for instance, the measured difficulty of a word is always as- 
sessed with respect to the other list items. 

“Due to schedule difficulties a one subject returned at 48 
hours but we still included their data in the modeling. In 
addition, 9 of the 72 hour subjects were scanned in a different 
fMRI scanner but we only include their behavioral data here. 
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Figure 1: Top: Mean recall performance (% correct) 
for individuals (dots) at each study-test delay. Bottom: 
Mean individual participant Judgment of Learning is cor- 
related with individual overall percent recalled within 
each delay condition. 


3. INFERRING KNOWLEDGE STATES FROM 


BEHAVIORAL AND NEURAL DATA 


The following section describes the basic mathematical struc- 
ture of our models. Similar to BKT, the core of our ap- 
proach assumes a probabilistic representation of the latent 
mnemonic status (e.g., remembered versus forgotten) of each 
item on the to-be-remembered list and we begin with es- 
tablished two- and three-state models that have shown ef- 
fectiveness in tracking learning and memory [5, 10]. Where 
our models differs from past knowledge tracing approaches is 
that we propose a mapping between these latent mnemonic 
states and patterns of brain activity that can allow the brain 
data to inform this inference. 


3.1 A Hidden Markov Model of Memory 


Like BKT, our approach draws heavily from the structure 
of HMMs. Each memory trace, 2, (i.e., memory for the 
association between two words) is represented as a non- 
homogenous, censored Hidden Markov Model with the fol- 
lowing properties (notation follows [25]): 


3.1.1 States 

Each trace can be in one of a number of discrete mnemonic 
states, S. For simplicity we will begin with a two state 
S = {sv,sx} model with states corresponding to unknown 
and known similar to BKT. However, we also consider a 
more complex, three-state model first proposed by Atkin- 
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son [5]. The three-state model has states S = {su,sK,sp} 
corresponding to unknown, known (with possibility of for- 
getting), and permanently known (see Figure 2). Across 
both types of models the sx and sp states represent mem- 
ories that have generally higher recall probabilities (e.g., 
Pr[recall = correct|sp] > 0), but the sx state is suscep- 
tible to decay between study events while the sp state is 
absorbing’. The current state of item i at time ¢ will be 
denoted qj. 


3.1.2 Priors 


A prior, =o, that captures our initial belief of the memory 
state of all items. The prior for a particular item memory, 1, 
can be written as 7255 = Pr[qizo = 5] for s € {su, sx} (two 
state) or s € {su,sx,sp} (three state). With unfamiliar 
learning materials we assume that the initial memory status 
is heavily biased towards the unknown state (ie., 7/2) is 
much higher than for any other state). 


3.1.3 Transitions 

A set of transition probabilities, 4, which determine the 
likelihood that a memory will move between the different 
states at each time point. In prototypical HMMs the transi- 
tion probabilities are stationary and the same transitions are 
applied at each time step. In our model there are different 
sets of transition probabilities which are applied at a given 
time step depend on the type of external “event”, e’, that 
occurs (e.g., a study trial versus a time step between trials; 
Figure 2). For memory trace 7 the transition probability of 
moving from state s to s’ after an event of type g will be 
denoted giaee Priqi = s'leb = g,q¢-1 
indicates the specific event type on trial t. 


s] where g 


Event types depend on the particular experiment design but 
here include “study trial” (study), “study with JOL trial” 
(study+JOL), “timestep in which memory decays” (decay), 
and “test trial” (test). Generally, during study or study#+JOL 
events we assume that items tend to transition from a more 
poorly learned state to a more fully learned state. The prob- 
ability of transitioning to a new state on a study trial is rep- 
resented in our three state model by parameters x,y and z 
and in the two-state model by parameter | (see Figure 2). 
During decay, items in a non-permanent state (sx) have a 
probability of transitioning to the unknown state with prob- 
ability f while items in sp (in the three-state model) remain 
in the permanently learned state. Decay events are nec- 
essary to account for the patterns of forgetting across the 
study-test delay intervals shown in Figure 1. We assume 
test trials have no effect on transitions as they appear at the 
end of the task. 


We define an experiment protocol, E, as a NV x 7 matrix 
where WV is the number of items being studied and T is the 
total number of micro-time steps modeled in the experiment. 
Each entry of the matrix, e+, codes which of a discrete set of 
event types occurred on a time step as described above. The 
protocol captures the dependencies between event sequences 


>One way for the model to capture the difference in perfor- 
mance at 24 versus 168 hours is to assume different mixtures 
of the sx and sp states following learning. For example, 
at 168 hours, traces in sp state may dominate correct re- 
sponses. 
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Figure 2: The matrix of transition probabilities for ei- 
ther study or decay events in the two and three state 
model. The letters within each matrix reflect the transi- 
tion parameters which are estimated to data. The state 
labels U are “unknown”, K are “known” (with possible 
forgetting), and P are “permanently known.” 


that influence different memory traces. For example, if word 
w is studied on a given trial, then all the other items on 
the list might undergo a memory decay event during the 
same time step. This way the protocol enforces the implicit 
tradeoffs of studying one item over others at a particular 
point in time. 


3.1.4 Observable signals 

The mapping between brain and behavior is made through 
a set of observation distributions, B, which define the 
probabilities that, on event type g at time t, an observable 
random variable of data type d, of'4 takes on a value wr 
from a (potentially infinite) alphabet v9". For each memory 
trace 7, we can write the probability of its associated observ- 
ables as bi9*4(u9"*) = Pr[o%4 = v%*|et = g, qi = s]. Obser- 
vation distributions in effect define the full generative model 
that links both behavior and neural information to underly- 
ing knowledge states. Here we consider three types of obser- 
vations: behavioral assessments (recall), JOLs (JOL), and 
hemodynamic {MRI measurements (MRI). However, this 
approach can easily incorporate many other measures in- 
cluding response time, pupil dilation, EEG measurements, 
or alternative {MRI signals. 


Behavioral Assessments. At certain points during the 
experiment the protocol might define a memory test event. 
On these types of trials the subject might be asked to re- 
call a studied item from memory or to recognize it from 
a list of alternatives. The response given on these trials 
is treated as an observation associated with this particu- 
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lar type of event. Specifically, the alphabet is v'""°*" ¢ 
{ correct, incorrrect} and v:"°"" € for g # test, reflecting 
the absence of any recall response on non-test events. The 
distribution of test question answers about memory trace 7 
from state s at time t, is then pistestossrecall (connect) = Precall, 
and bitestssrecall (in correct) = 1—Pprecau, Where precall, i8 de- 
fined (or fitted) for each memory state. For other trial types, 
ie. g & test, by?) = 1. So the update to state pos- 
terior probabilities on those events is driven by the state 
transitions. The parameters governing the probability of is- 
suing a correct response conditioned on the latent memory 
state are equivalent to the “guess” and “slip” parameters in 
BKT. 


Judgments of Learning. JOL responses were only given 
on the last study trial (a study+JOL event). JOL data were 
included in the model as the raw response/100 to each JOL 
trial for each person, i.e. u%"UtIOh JOE E [0,1] and null 
for other trial types. We model the distribution of JOLs as 
a truncated Gaussian distribution in the range 0 to 1, ice. 
pe ee IO _ TN(s0L, /TIOLes 0, 1) with [LJOL, and 
ojoL, defined independently for each state s. 


Hemodynamic fMRI measurement. Functional MRI 
scans provide time-series data for each of a large set of 3- 
dimensional voxels tiling the imaged volume (e.g., the brain). 
In studies measuring {MRI activation levels at specific time- 
points it is common to estimate the activation level within 
voxels and then average voxels within spatial clusters, whether 
spatially contiguous (regions of interest, or ROIs) or sets 
of spatially disjoint but functionally related voxels show- 
ing similar response profiles (e.g., independent components). 
Due to the central limit theorem we can expect that the 
mean activation within a set of such voxels will be approx- 
imately normal. We also expect, based on prior work, that 
there will be a mean shift in the {MRI activation levels of var- 
ious brain regions during study trials that are later remem- 
bered compared to those that are later forgotten [13, 26]. We 
collect {MRI data for each study trial. The {MRI observation 
consists of Newry features. Therefore, v°@@47/"! © RNmri 
and null otherwise. We model the fMRI state observation 


distributions as independent Gaussians for each feature ni, 
. i,study,s,MRIn, 
ie. b, : 


= N(MMRIn, 5; OMRIn, 5) 

3.1.5 Inference 

The full model is specified by a protocol, FE, a set of priors 
over the states, 74=0, a set of transition probabilities, A, and 
a set of observation distributions associated with each state- 
event pair, B. Using Bayes’ rule, the posterior probability 
that a memory trace on trial t is in state s’ € S is: 


. ri . / . 
1,9,8 ,d_1,9,8s—s 1,8 
bye ae Te—-1 
(1) 
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3.1.6 Illustrative calculation 

To illustrate the impact of hypothetical fMRI observations, 
consider Figure 3 which shows the protocol, E, for the tim- 
ing of study events for two memory traces (Panel A): item 
1 (black) and item 2 (white). On time points where item 
1 is studied the protocol has a black cell (and similarly for 
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Figure 3: Example illustration of the effect of fMRI ob- 
servations on inferences about latent knowledge in a two 
state-model. A) Protocol showing the timing of study 
events for item 1 (black boxes) and item 2 (white boxes). 
B) State posterior estimates for item 1 obtained from a 
hypothetical setting of the two-state model parameters 
(dashed blue = Sy, solid orange = Sx). C) Hypothetical 
“observed” fMRI signal on each study trial for item 1 
(inset shows the probability density function over MRI 
observation values conditioned on the state). D) State 
posteriors for item 1 after incorporating the observation 
likelihoods from study trials for this item. The inferred 
state probabilities are dramatically altered by the incor- 
poration of the MRI observation (see text). 


item 2 using white). Panel B shows hypothetical evolutions 
over time for the two-state posterior probabilities {sv, sx } 
for item 1 obtained by applying the study and forgetting 
transitions as shown in Figure 2 but without other observ- 
able information (i.e., a Markov model). In this example we 
set the / transition parameter applied on study events to 0.4 
and the f parameter governing decay to 0.1. 


At time point 1 the priors reflect the fact that before any 
study attempts a person is unlikely to know the item (e.g., 
ca = .9). At time point 6, item 1 is presented for study 
for the first time and the posterior probabilities of each state 
are updated by applying the study transition probabilities 
to the state posteriors on time t— 1. Immediately after this 
study event, Panel B shows that there is now an increased 
probability of item 1 being in state sx (solid orange line). 
However, between time point 6 and 40, item 1 is not pre- 
sented again and so for each time step between we apply the 
decay transitions leading to gradual forgetting. 


The addition of observable signals that are probabilistically 
related to latent memory states alters these predictions. The 
inset figure in Panel C shows how the mean response from a 
set of voxels in the human brain might result in Gaussian- 
distributed summed BOLD signals that overlap but differ 
depending on the state of the memory (e.g., signal being 
stronger for sx, orange, than for the su, blue, state). Panel 
C illustrates a hypothetical sequence of {MRI measurements 
that could be made about item 1 during the study trials 
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(i.e., samples from the Gaussian distributions from the inset 
plot). 


Panel D shows the posterior estimates of item 1’s state at 
each time point obtained through combination of the transi- 
tion dynamics and MRI observations (i.e., using the Hidden 
Markov Model). As can be seen comparing panel B and D, 
the addition of observations that are probabilistically associ- 
ated with latent states can lead to different inferences about 
the posterior probabilities over those states. Until item 1 
is presented at time point 6 the posterior estimates are the 
same in the Markov and Hidden Markov Models. However, 
at time point 6 we observe a fMRI signal of a particular 
magnitude which in turn has a likelihood of originating from 
each of the two underlying states. If we take into account 
the observed signal, our estimates of the posterior over states 
change, since a fairly small signal was observed and the like- 
lihood of such a signal is substantially larger for state sy 
than s~. Consequently, our belief that the item is in state 
Sk is lower when we include the observation in our estimates 
than when we simply use the transition probabilities. 


Similarly, at time point 40 item 1 is presented for a second 
study opportunity. Without observations our best estimate 
of the state probabilities suggests we should be indifferent 
between sy or sx, but the larger MRI observation observed 
is unlikely to have emerged from the unknown state and so 
the observation-constrained posterior estimates are weighted 
much more heavily towards the sx state. By including the 
Markov dynamics characterizing the likely temporal evolu- 
tion of memories, we can adjudicate between otherwise am- 
biguous neural signals by appropriately dealing with uncer- 
tainty in measurement. 


3.1.7 Model Evaluation and Fitting Procedure 
The following section details the model evaluation, compar- 
ison, and feature selection strategies we used. 


Model parameterization. Partially due to identifiability 
concerns [37, 16], some parameters were fixed to semanti- 
cally coherent values [15], while others were estimated from 
the data. 


For all words we fixed the initial state priors, 7=0, as [.99, .01] 
or [0.99, 0.005, 0.005] for su, sx in the two-state model or 
Su, 8K, and sp in the three-state model, respectively. This 
was motivated by the fact that none of the learners in our 
dataset had prior experience with Lithuanian. We also fixed 
the probabilities of giving the correct test response, Precall 
as [.01,.9] and [.01,.9,.9] for latent memory states sy and 
8K (two-state model) or sy, sx, and sp (three state model, 
see below), respectively. This reflects the assumption that 
it is very unlikely that one would guess the correct answer 
in a cued recall test without any memory (s = su) and that, 
as in [5], the primary difference between sx and sp in the 
three-state model is the susceptibility to decay over time 
rather than the availability of a memory to recall (via the 
influence of the f parameter; see Figure 2). 


Fitted parameters include those determining the transition 


probabilities and observation distributions within each model. 


Both the two- and three-state models have transition proba- 
bilities to fit for each word pair w (summarized in Figure 2). 


In the two-state model these are the |,, and fw parameters 
controlling memory strengthening and decay, respectively. 
For the three-state models, the tw, yw, and zw values con- 
trol transitions between states during study opportunities 
and the f. parameter determines forgetting rates. 


Although the learning trajectories for each word pair were 
instantiated in separate HMMs, to get better estimates of 
the parameters we used a hierarchical Bayesian model that 
used group-level priors over the parameters to regularize the 
estimates. Each 2» was drawn from a Logit-Normal(x, oz) 
where « itself was drawn from a Normal(0, 6) and oz was 
drawn from a Truncated-Normal(0, 1). The model for the 
fw parameters was exactly the same. The simplices zy 
were generated using the following procedure: z and y were 
drawn from a Normal(0, 6). zw and yw» were drawn from 
Normal(z, oz) and Normal(y, sigmay,) respectively with oz 
and o, both drawn from a Truncated-Normal(0, 1). Finally, 
ZYw was set to softmaxz((0, zw, yw). This can be thought of 
as a multivariate generalization of the Logit-Normal with a 
diagonal covariance matrix. 


When fitting models that incorporated JOLs or MRI data 
we also estimated the means and variance parameters for the 
Gaussian (truncated for JOLs) observation likelihood from 
each latent state. For the JOL distributions, each zjoL, was 
drawn from a Normal(.5, .5) and each ojo, was drawn from 
Inverse-Gamma(1, 2). Similarly, for each fMRI feature n; 
(see below) in state s, zwerr,.s was drawn from a Normal(0, 
1) and omeri,,s was drawn from an Inverse-Gamma(1, 2). 


fMRI feature selection. After standard MRI preprocess- 
ing [11], we selected data for inclusion in the model. We 
reduced the dimensionality of the {MRI data using group 
spatial independent components analysis (ICA) using the 
ICASSO algorithm as implemented in the GIFT ICA tool- 
box (http://mialab.mrn.org/software/gift/) [?, ?]. This pro- 
cedure, which is blind to trial information and memory out- 
come, resulted in a set of 60 independent components that 
are characterized by a particular temporal (the timecourse 
of activation) and spatial (the loading of each component on 
fMRI voxels) profile for each participant. Components that 
were unstable across estimations (ICASSO) and components 
associated with signal from ventricles or motion were dis- 
carded leaving 43 independent components for inclusion as 
model features. Individual trial activations for each identi- 
fied component were summarized as the mean of timepoints 
encompassing 4-6 seconds post-stimulus onset (to account 
for the temporal lag in the BOLD response), resulting in 
one activation value for each trial in each component for 
each MRI participant. 


Model estimation. We used MCMC sampling via the 
NUTS algorithm as implemented in Stan [31] to estimate 
the posterior over the parameters (4 chains of 200 itera- 
tions; 100 per chain discarded as burnin; 400 total samples 
per parameter). To ensure convergence, we checked that 
estimates of the probability of recall had low R values (a 
measure of whether the sampling chains are converging to 
similar estimates) [32, 7]. 


Model evaluation. In order to compare models, we want 
to evaluate how well our models will predict new, unseen 
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data. It is generally agreed that the generalization method 
with the fewest assumptions is leave-one-out cross valida- 
tion, which is preferred when sufficient data and computa- 
tional resources are available [39]. To conserve on computa- 
tional resources, here we use K-fold cross validation, setting 
K to 10. Because our goal is to assess the utility of incor- 
porating MRI signals into a memory model, the held-out 
data only included data from the 20 fMRI subjects. We di- 
vided up the data from these subjects into ten equally sized 
folds. We then trained ten versions of each model where the 
training set consisted of all of the data from behavior-only 
subjects and nine of the ten folds of the {MRI subjects. On 
the held-out test set, we used the identity of the words and 
the trial timings (and JOL or {MRI observations, where ap- 
propriate) to generate the posterior probability of recall for 
each held out word at the time of test. 


As we are primarily interested in our ability to classify a 
new piece of data as successfully recalled or not rather than 
the log likelihood of the trial under the model, we adopted 
a cross-validated area under the ROC curve metric (ROC- 
AUC). The ROC-AUC can be interpreted somewhat like 
an accuracy measure where 0.5 represents chance prediction 
and higher values indicate better predictive performance of 
the model. Using ROC-AUC allows us to compare the held- 
out predictive performance of models with varying numbers 
of parameters while providing a metric of model performance 
that is relatively insensitive to class imbalance and does not 
prioritize one kind of error over another (e.g., trading off Hits 
versus Misses). The model ROCs were defined by calculat- 
ing, in each cross validation fold, the proportion of predicted 
as remembered trials that were recalled correctly (Hits) and 
the proportion of predicted as remembered trials that were 
not (False Alarms) at each level of posterior recall probabil- 
ity given by the model. 


Model Comparison. We fit three variants of each of the 
two- and three-state models: a Recall model fit to trial tim- 
ing and recall performance (the binary recall success scores 
for each word); a model fit to trial timing, recall perfor- 
mance, and JOL observations (Recall+JOL); and a model 
fit to trial timing, recall performance, and fMRI observa- 
tions (Recall+MRI). In each case the training data included 
data from all of the behavioral participants and a subset of 
the MRI participant data, and models were evaluated on 
held-out data. The logic of these comparisons is to see if the 
models incorporating additional observations (Recall+JOL 
and Recall+MRI) provide a better basis for prediction than 
do the purely behavioral models. In addition, we are inter- 
ested in whether the model incorporating MRI observations 
is able to outperform the model incorporating JOLs. This 
would suggest that the brain data contains more information 
relevant about memory performance than do people’s own 
self-reports about their memory fidelity. While we are ul- 
timately interested in held-out predictive performance, the 
models do differ in model complexity. In raw numbers, for 
the two-state models, the Recall model had 2 x 45 word pa- 
rameters and 4 hyperparameters, the Recall + JOL model 
added 4 parameters, and the Recall + MRI model added 
4Nemrt parameters. For the three state models, the Recall 
model had 4 x 45 word parameters and 7 hyperparame- 
ters, the Recall + JOL model added 6 parameters, and the 
Recall + MRI model added 6Neuri parameters. However, 


Table 1: Cross validated Area Under the Curve of 
the Receiver-Operating Characteristic (ROC-AUC) 
with + standard error (in parentheses) across folds. 


| two-state model | three-state model 
Recall 0.64 (.02) .64 (.02) 
Recall+JOL | 0.73 (.01) .73 (.01) 
Recall+MRI | 0.72 (.02) .75 (.01) 


due to the hierarchical nature of these models, the effective 
number of parameters may have differed depending on the 
amount of regularization done by the hierarchical prior. 


4. RESULTS 


4.1 Two-state model 

For each variant of the two-state model (Recall, Recall+JOL, 
Recall+MRI) we computed the ROC-AUC for predictions of 
recall accuracy in held-out trials for the MRI participants. 
The Recall model, trained on the timing of study and test 
trials and recall performance, achieved a mean (across held- 
out folds) ROC-AUC of 0.64 (+.02), providing an above 
chance baseline model against which to evaluate the utility 
of JOL and fMRI observations (Figure 4A). 


The Recall+JOL, which adds judgments of learning to both 
the training and evaluation of the Recall model, achieved 
a mean held-out ROC-AUC of .73 (+.01), improving our 
predictions relative to the Recall model. This shows that 
metacognitive judgments collected from individuals at the 
end of a learning session can be used to refine predictions 
about held-out recall performance. 


We next assessed whether fMRI signals recorded during study 
events could be leveraged to make predictions about held- 
out performance. The Recall+MRI model yielded a held-out 
ROC-AUC of 0.72 (4.02). Although the held-out perfor- 
mance did not surpass the Recall+JOL model, this result 
indicated that there may be information in the MRI mea- 
surements that could be used to make predictions about 
held-out memory recall performance. 


4.2 Three-state model 


We next considered whether a more elaborated model of 
memory could leverage more subtle dynamics of the {MRI 
data.° The held out ROC-AUCs for the Recall and Re- 
call+JOL three-state models did not differ from those ob- 
served in the two-state model (Figure 4B). However, the 
three-state MRI model boosted the held-out AUC to .75 
+.01) which was an improvement compared to the original 
two-state Recall+MRI model. This was also, in terms of 
held-out predictions, the most successful model we consid- 
ered in these comparisons (but see Conclusions), building 
confidence in the utility of incorporating neural signals into 
knowledge tracing models. 


— 


® Although our primary interest in this work is evaluating 
the held-out predictions of our models, we note that com- 
plexity of the three-state model means that three-state Re- 
call or Recall+JOL variants may not be identifiable due to 
the sparseness of observations (a single recall outcome or 
the recall outcome and a single JOL) [37, 16] However, for 
the MRI participants we have data for every trial, enabling 
estimation of a three-state Recall+MRI model. 
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Figure 4: ROC curves for held-out predictions in each of 
the two-state (panel A) and three-state (panel B) model 
variants (Recall, Recall+JOL, Recall+MRI). The curves 
show the mean + sem across each of the cross validation 
folds. 


In addition, whereas the Recall and Recall+JOL models did 
not discriminate between the two- and three-state models, 
the {MRI data enabled better predictions using the three- 
state model, highlighting the utility of neuroimaging data 
in selecting between cognitive models. 


4.3 Relating model dynamics to the brain 

In addition to the improvements in memory prediction af- 
forded by joint modeling of behavioral and neural data, our 
approach also allows for examination of {MRI data in light 
of the estimated models. Figure 5 presents two example 
analyses in this vein. 


Figure 5A shows the contrast map resulting from regressing 
the change in posterior probability of sx associated with 
each study trial (as estimated in the two-state Recall model) 
against the fMRI time-series in each voxel. Using the esti- 
mated two-state Recall model parameters, we extracted the 
state posteriors on each study event for the MRI partici- 
pants based on the sequence and timing of study trials. We 
then calculated the change in predicted state posterior from 
just before to just after a study trial and used this change as 
the predictor for brain activations. This analysis is related 


-25 0.0 25 5.0 
fMRI activation 


Figure 5: Examples of using estimated model to ana- 
lyze the brain. A) Coronal slice showing left anterior 
hippocampal voxels tracking the change in sx state pos- 
terior for each study trial. B) Topography (left; axial 
slice) and posterior predictive distributions (right) for 
MRI activations from most informative component in 
the three-state model. Individual traces show the distri- 
butions for each fold of the cross validation 


to the General Linear Model approach often used in the 
subsequent memory literature, except that rather than us- 
ing binary regressors that coded for remembered or forgotten 
outcomes as determined by a recall test, we used the esti- 
mated continuous state posteriors from the two-state model. 


Using a knowledge tracing model in this way to provide es- 
timates of when a particular item is learned during a study 
sequence with multiple repetitions allows for more sensitive 
analyses of the brain’s relationship to cognitive processes 
unfolding over extended time. Interestingly, we found that 
the voxels significantly correlated with the change-in-state- 
posterior regressor were a cluster in left anterior hippocam- 
pus, consistent with the hypothesized role for this region in 
encoding new information into memory [12]. 


An alternative way to use the fitted models is to examine the 
estimated fMRI features’ observation likelihoods for each la- 
tent knowledge state. The Recall+MRI model included acti- 
vation from a number of independent components as candi- 
date neural features. After estimating the model, the {MRI 
observation parameters can be used to assess which compo- 
nents provided information about the latent model states. 
Used in this way, the joint model can be used as a tool for 
understanding how complex cognitive dynamics, especially 
those that might not be apparent in a more conventional 
analysis (e.g., a traditional subsequent memory analysis that 
only considers activation at the time of study and perfor- 
mance at the time of test), are instantiated in the brain. 
The most informative component in our model was associ- 
ated with voxels in lateral occipital and fusiform gyrus re- 
gions involved in processing complex visual inputs, as shown 
in an axial slice through the brain (anterior/posterior of the 
brain is up/down in the image) in figure 5B. The poste- 
rior predictive distributions for component activation condi- 
tioned on model state are also shown in figure 5B, and these 
estimated distributions showed stronger activation for items 
in the K or P states relative to U. 


5. CONCLUSIONS 


We evaluate a framework for integrating neuroimaging record- 
ings into a knowledge tracing model. Our approach builds 
upon recent reports showing robust memory-related signals 
in the brain. We collected a medium-sized data set of hu- 
man participants performing a second-language acquisition 


Proceedings of the 11th International Conference on Educational Data Mining 226 


task both inside and outside a scanner. We then compared 
a variety of models on their ability to predict held out data 
for the MRI participants. Our most predictive model was a 
three-state hidden Markov model that incorporated neural 
measurements. This is interesting because this model was 
more predictive than alternative approaches that leveraged 
participants’ self-assessment of their learning (JOLs). One 
conclusion from this analysis is that there seem to be mea- 
surable signals in the brain that index the quality of memory 
with higher fidelity than people’s own introspective access. 


We also observed that the use of {MRI measurements en- 
abled discriminating between models that were equivalent 
when using behavioral data (recall or JOL) alone. Whereas 
the held-out performance of the two- and three-state mod- 
els was the same for the Recall and Recall+JOL model vari- 
ants, using {MRI data to inform the model estimation re- 
vealed an improvement for the three- compared to the two- 
state model. This result points to the ways in which joint 
modeling of behavioral and neural data can afford insights 
into cognitive dynamics that might not be available to re- 
searchers focusing on more restricted kinds of data. 


Although the results are promising, our assumptions about 
the {MRI data at this stage are simplistic. For example, our 
model assumed that the distribution of {MRI signals was 
stable across time. However, it is well known that fMRI 
signals often show a pattern of repetition suppression [18] 
where the measured BOLD signal is systematically lower on 
subsequent presentations of an item. A more sophisticated 
analysis of the brain may lead to improvements in our mod- 
els. Another particularly interesting direction is to attempt 
to model individual learner abilities (c.f., [41, 21]) on the 
basis of patterns of brain activity given the large variance in 
overall performance across participants (see Figure 1). 


Modifications to the model structure might also improve pre- 
dictions. As an example, in ongoing work we estimated the 
three-state Recall+MRI model but modeled the {MRI ob- 
servations as arising from transitions between states rather 
than from the states themselves (i.e., each {MRI component 
has a distribution of activations associated with staying in a 
state and another distribution associated with switching be- 
tween states). The three-state version of this Recall+MRI- 
Transition model yielded a held out AUC of 0.77 (+.02), 
which is our best performing model to date. This shows 
that there is certainly more signal we can exploit from the 
data by improving our generative model of the fMRI sig- 
nal. Attempts to improve the fMRI modeling and explore 
different model structures are continuing. 


We have also illustrated several ways in which this kind of 
simultaneous modeling approach might feedback to our un- 
derstanding of the role of the brain in supporting learning 
and memory. Using a model-based regressor coding for the 
change in posterior probability of latent knowledge states, 
we identified a significant effect in a left anterior hippocam- 
pus region that is known to be involved in memory formation 
on the basis of past studies [12]. The similarities between 
this novel analysis approach and past cognitive neuroscience 
studies give converging evidence about the hypothesized role 
of these regions. We also used our estimates of the {MRI ob- 
servation distributions to examine the relationship between 


fMRI activation arising from different neural components 
and the latent knowledge states instantiated in the model(s), 
which is a novel approach to understanding the way psycho- 
logical mechanisms or processes may be implemented in the 
brain. 


While we acknowledge the practical limitations of acquir- 
ing neuroimaging data in an educational setting — although 
advances in EEG technology and the established ability to 
measure subsequent memory signals with EEG may enable 
such use in restricted settings [17, 14] — overall we believe 
this work represents an encouraging first step for knowledge 
tracing approaches that utilize indirect neural information 
as opposed to explicit tests. 
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